Kernel Principal Component Analysis (KPCA)¶
Kernel PCA (KPCA) is a non-linear dimensionality reduction technique (like feed-forward autoencoders). It is often used with imagery and other data in which a linear solution cannot effectively separate relevant clusters of data. The kernel component works the same as a kernel in an SVM, and makes use of a similar 'kernel trick' to perform necessary calculations while avoiding a possibly inifinite dimensional feature space.
We will continue on with a modified example from the notebook on PCA. Like PCA, KPCA has a specific set of requirements for the algorithm to provide useful results. As a general rule, you should attempt standard/linear PCA first and decide if that effectively separates your data. Only if that fails should a non-linear method be tried. As there are more variables in kPCA, there are some stricter requirements that need to be met to utilize it.
Requirements
- We require a large dataset size.
- The data must be fairly evenly distributed. Otherwise KPCA will likely have a more difficult time finding the appropriate separation.
- The data is non-linear (inferred from the failure of PCA or domain knowledge)
- Data in the feature space (see derivation below for meaning of 'feature space') is centered, like with PCA. Standardization is often required to obtain good results as well.
Derivation
An interesting property of kPCA is that the computation time scales with the amount of samples $n$ rather than the size of the input dimension, as is the case with PCA. So, if your dataset has many samples, computation time can become an issue.
First, we assume we have non-linear data samples $x_i$ forming the rows of our data matrix $X \in R^{n \times m}$. We will refer to $R^{m}$ as the input space, the space our original data resides in. Motivated by the case of kernel-based SVM approaches to find non-linear solutions in our input space, we apply a transformation $\phi : R^{m} \to R^{M}$ to each $x_i$. This new space holding our transformed data is referred to the feature space. Note $M$ may be infinite. It is in this space we apply our standard PCA algorithm. For the time being, we will assume the transformed data in the feature space is centered, though this is not necessarily true.
Following the steps of PCA, we calculate the covariance matrix: $$C \equiv \frac{1}{n-1} \phi(X^T)\phi(X) = \frac{1}{n-1} \sum \limits _{i=1} ^{n} \phi(x_i)^T \phi(x_i) \in R^{m \times m}$$ and solve the eigenvector/eigenvaue problem $$V \Lambda = CV$$
Note that for the kth eigenvector $v^{(k)}$: $$ \lambda_k v^{(k)} = Cv^{(k)} = \Bigg( \frac{1}{n-1} \sum \limits _{i=1} ^{n} \phi(x_i)^T \phi(x_i) \Bigg ) v^{(k)} = \frac{1}{n-1} \sum \limits _{i=1} ^{n} (v^{(k)} \phi(x_i)^T) \phi(x_i)$$
This means each eigenvector is in the span of the transformed variables. In other words: $$v^{(k)} = \sum \limits _{i=1} ^{n} a_i \phi(x_i)^T $$ for some values $a_i$, $i = 1 ... n$.
If we take an arbitrary $\phi(x_k)T$ and multiply it on the right of the above equation, we obtain the following: $$(n-1)\lambda_k K \vec a = K^2 \vec a$$ or equivalently, $$(n-1)\lambda_k \vec a = K \vec a$$ with K the kernel matrix defined as the matrix of pairwise dotproducts between transformed data in the feature space: $$(K)_{ij} = \phi(x_i) \phi(x_j)^T \equiv K(x_i, x_j)$$
To summarize the above, if we have the inner product matrix/kernel matrix $K$ (or, equivalently, we know how to calculate inner products in the feature space), we can find its eigenvectors and hence calculate $\phi(x) v^{(k)}$ wthout actually calulating the arbitrarily large feature space covariance matrix and its eigenvectors. Note the quantity $\phi(x) v^{(k)}$ represents our principal components in the feature space: $$\phi(x) v^{(k)} = \sum \limits _{i=1} ^{n} a_i^{(k)} (\phi(x) \phi(x_i)^T) = \sum \limits _{i=1} ^{n} a_i^{(k)} K(x, x_i)$$
We also require unit length for the eigenvectors $v^{(k)}$ in the feature space. This is equivalent to $v^{(k)} v^{(k)} = 1$, which yields the following requirement: $$ 1 = \lambda_k <a^{(k)}, a^{(k)}>$$
Recall that we require centered data when performing PCA. In practice, we calculate a kernel matrix K using the kernel function $K(x_i, x_j)$ that implicitly defines the transformation and feature space. This data isn't necessarily centered in the feature space however. To satisfy this requirement, we must calculate the corresponding K for centered data and continue the algorithm like normal from there.
We must work with $\tilde{\phi}(x_i)$ where $$\tilde{\phi}(x_i) \equiv \phi(x_i) - \frac{1}{n}\sum \limits _{i=1} ^{n} \phi(x_i)$$
So, the $K$ we need, denoted $\tilde K$, is calculated as: $$\tilde K = \tilde{\phi}(X) \tilde{\phi}(X)^T$$
Through substitution, this can be show to equate to: $$\tilde K = K - 1_{\frac{1}{n}}K - K1_{\frac{1}{n}} + 1_{\frac{1}{n}}K1_{\frac{1}{n}}$$ where $1_{\frac{1}{n}}$ represents the matrix with all entries equal to $\frac{1}{n}$.
Application
It is fairly trivial to implement an example in $2D$, as many exist online (e.g. a cluter of points surrounded by a circle of points). In $3D$, the utility of kPCA similarly arises when the data cannot be spearated in a linear fasion. More specifically, it is useful when no projection of the data effectively separates the classes of interest.
First, we will generate two classes of data residing on two differet surfaces in $R^3$.
The following will produce a noisy plot of our data in $3D$. Try experimenting with the parameters and rotating/zooming with the interactive plot. Additionally, we assign classes $c1$ and $c2$ to the data, which are color coded.
# Imports
import scipy
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA, KernelPCA
import plotly.express as px
from plotly.subplots import make_subplots
from plotly.offline import plot
# Make the fist class of data, a surface is R3.
X, Y = np.meshgrid(np.linspace(-5, 5, 50), np.linspace(-5, 5, 50))
Z = np.sqrt(X**2 + Y**2)
# Convert class 1 data into matrix of vectors.
C1 = np.vstack([X.ravel(), Y.ravel(), Z.ravel()]).T
# Add some noise.
C1 = C1 + np.random.normal(loc=0.0, scale=0.2, size = C1.shape)
# Assign class label 'c1' in a DataFrame.
df_c1 = pd.DataFrame(C1, columns = ['x', 'y', 'z'])
df_c1['class'] = ['c1']*len(df_c1)
# Now make the second class of data, a sphere in R3 with class label 'c2'.
radius = 2
center = (0, 0, 4)
P, T = np.meshgrid(np.linspace(0, 2*np.pi, 50), #phi in spherical coordinates
np.linspace(0, np.pi, 50)) #theta in spherical coordinates
# Convert spherical coordinates to Cartesian coordinates.
x = radius * np.sin(T) * np.cos(P) + center[0]
y = radius * np.sin(T) * np.sin(P) + center[1]
z = radius * np.cos(T) + center[2]
C2 = np.vstack([x.ravel(), y.ravel(), z.ravel()]).T
C2 = C2 + np.random.normal(loc=0.0, scale=0.2, size = C2.shape)
df_c2 = pd.DataFrame(C2, columns = ['x', 'y', 'z'])
df_c2['class'] = ['c2']*len(df_c2)
# Combine everything into one DataFrame.
df = pd.concat([df_c1, df_c2])
fig = px.scatter_3d(df,
x = 'x',
y = 'y',
z = 'z',
title = 'Input data.',
color = 'class')
fig.show()
In more complex datasets, we will not understand its behavior or even necessarily know there's non-linearity in the dataset. So, we will try linear PCA and see what happens.
First, we subtract the mean from the data.
X = df[['x', 'y', 'z']].to_numpy()
X_mean_sub = X - np.mean(X, axis = 0)
pca = PCA(n_components = 2)
X_pca = pca.fit_transform(X_mean_sub)
df['pca_1'] = X_pca[:,0]
df['pca_2'] = X_pca[:,1]
fig = px.scatter(df,
x = 'pca_1',
y = 'pca_2',
title = 'PCA results.',
color = 'class')
fig.show()
We can see the data is not sufficiently separated; the two classes significantly overlap in the center.
We will now try kPCA by working out the solution long-hand from the theory described above. Like PCA, note that kPCA has an sklearn implementation as well.
The kernel of choice is the rbf kernel, defined as follows: $$K(x, y) = <\phi(x), \phi(y)> = e^{-\gamma||x - y||^2}$$ $$K: R^3 \times R^3 \to R \;\;\; \text{and} \;\;\; \gamma \in R$$
In general, one of the most difficult parts of applying kPCA knowing what kernel to apply and with what parameters.
# First, decide on the kernel. We will use the RBF kernel with gamma = 1.
gamma = 1
def rbf(x, y, gamma = 0.5):
norm_diff = np.linalg.norm(x - y)
return np.exp(-1*gamma*norm_diff**2)
# Initialize K matrix.
K = np.zeros((X_mean_sub.shape[0], X_mean_sub.shape[0]))
# Fill in the the top right of the Kernel matrix at all coordinates K_ij.
for i in range(len(X_mean_sub)):
for j in range(i, len(X_mean_sub)):
x = X_mean_sub[i] #shape (n, )
y = X_mean_sub[j] #shape (n, )
k_ij = rbf(x, y, gamma = gamma)
K[i, j] = k_ij
# Now we must fill in the lower left of the matrix.
K = K + K.T
Before finding eigenvectors and eigenvalues, we must center the data in the feature space.
one_n = np.ones((X_mean_sub.shape[0], X_mean_sub.shape[0]))*(1/len(X_mean_sub))
K_tilde = K - one_n@K - K@one_n + one_n@K@one_n
Now, diagonalize $K$. From the above requirement for the normalization of $a^{(k)}$, we can write $$1 = \lambda_k (a^{(k)} \cdot a^{(k)}) \implies 1 = \lambda_k ||a^{(k)}||^2 \implies ||a^{(k)}|| = \frac{1}{\sqrt{\lambda_k}}$$ So we replace the eigenvectors as follows $$a^{(k)} \leftarrow \frac{a^{(k)}}{||a^{(k)}||} \cdot \frac{1}{\sqrt{\lambda_k}}$$
eigenvalues, eigenvectors = scipy.sparse.linalg.eigs(K_tilde, 2, which = 'LR')
lambda_1 = eigenvalues[0].real
lambda_2 = eigenvalues[1].real
a_1 = eigenvectors[:, 0].real
a_2 = eigenvectors[:, 1].real
a_1_norm = (a_1/np.linalg.norm(a_1)) * (1/ np.sqrt(lambda_1))
a_2_norm = (a_2/np.linalg.norm(a_2)) * (1/ np.sqrt(lambda_2))
Now we construct the principal components.
# Form matrix of eigenvectors.
A = np.hstack([a_1_norm.reshape(-1, 1), a_2_norm.reshape(-1, 1)])
# Project feature space samples onto v-vectors via kernel trick.
k_pcs = K.T @ A
Finally, we can view the data projected onto $2D$ space via the non-linear transformation.
df['kernelPCA_1'] = k_pcs[:, 0]
df['kernelPCA_2'] = k_pcs[:, 1]
fig = px.scatter(df,
x = 'kernelPCA_1',
y = 'kernelPCA_2',
title = 'Kernel PCA results.',
color = 'class')
fig.show()
This produces a plot similar to what scikit-learn's implementation will provide. Note the classes are separated much more effectively than in the linear PCA approach shown above.
Feel free to experiment with the code and try out a few different kernels to see if you can obtain better separation. Another popular kernel supported by scikit-learn is the polynomial kernel given by: $$K(x, y) = (\gamma x \cdot y + c)^d$$
Thank you for reading.
References
Scholkopf, Bernhard; Smola, Alexander; Müller, Klaus-Robert (December 1996). Nonlinear Component Analysis as a Kernel Eigenvalue Problem. (Technical Report)
An overview of the approach by David R. Thompson at the California Institute of Technology (2014) can be found here: